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ABSTRACT 

This paper presents a novel method for extracting acoustic 
features that characterise the background environment in au¬ 
dio recordings. These features are based on the output of an 
alignment that hts multiple parallel background-based Con¬ 
strained Maximum Likelihood Linear Regression transforma¬ 
tions asynchronously to the input audio signal. With this 
setup, the resulting features can track changes in the audio 
background like appearance and disappearance of music, ap¬ 
plause or laughter, independently of the speakers in the fore¬ 
ground of the audio. The ability to provide this type of acous¬ 
tic description in audiovisual data has many potential applica¬ 
tions, including automatic classification of broadcast archives 
or improving automatic transcription and subtitling. In this 
paper, the performance of these features in a genre identifica¬ 
tion task in a set of 332 BBC shows is explored. The pro¬ 
posed background-tracking features outperform short-term 
Perceptual Linear Prediction features in this task using Gaus¬ 
sian Mixture Model classifiers (62% vs 72% accuracy). The 
use of more complex classifiers. Hidden Markov Models and 
Support Vector Machines, increases the performance of the 
system with the novel background-tracking features to 79% 
and 81% in accuracy respectively. 

Index Terms — Acoustic background, genre identifica¬ 
tion, broadcast data. 

1. INTRODUCTION 

The media domain presents many opportunities for the appli¬ 
cation of speech technologies. With audiovisual data growing 
larger and larger every day due to digital television, social me¬ 
dia and on-line streaming there is a great need for performing 
automatic processing of this type of data. Possible applica¬ 
tions include automatic transcription and subtitling, classih- 
cation of audiovisual archives and acoustic information re¬ 
trieval. Further research in this area is being also pushed by 
initiatives like the MediaEval Benchmarking for Multimedia 
Evaluation ID, which covers several of these tasks in the mul¬ 
timedia domain. The technologies required cover the whole 
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range of speech technologies; Automatic Speech Recogni¬ 
tion (ASR); speaker identification; diarisation; identification 
of acoustic events; etc. 

The ability of automatically detecting the genre of a 
broadcast show falls within the set of potential applications 
of speech technologies that could become useful within the 
media domain. While genres are subjective divisions usually 
defined depending on the content of the show, shows belong¬ 
ing to the same genre will share similar acoustic conditions 
that can be detected using automatic speech processing. In 
this context, multimodal approaches, merging features from 
audio and video processing, have been very commonly used 
EH Ilia and have consistently provided results above 90- 
95% accuracy. Regarding the type of acoustic features used, 
from the early works the focus of research has been on the use 
of short-term features 0 , including Mel-Erequency Cepstral 
Coefficient (MECC) features Q- A full evaluation of the use 
of MECCs and Gaussian Mixture Model (GMM) classihers 
across 3 different test sets achieved 86% in a RAI dataset, 
78% in a Quaero dataset and 58% in a YouTube dataset 0. 
Using also MECCs and GMMs, other authors achieved 94% 
accuracy in the RAI dataset when processing whole shows 
and 82% on segments as short as 6 seconds 0. 

The different performances across sets indicate that short¬ 
term spectral features present solid classification abilities, 
but are not robust in heterogeneous and complex datasets. 
MECCs, as well as Perceptual Linear Prediction (PLP) fea¬ 
tures 0, represent the short-term characteristics of speech, 
like the spectral properties of phonemes and speakers, but are 
not designed to characterise long-term properties of audio. 
This could explain why, in homogeneous datasets, where 
shows and speakers might often recur, like episodes from the 
same TV series or broadcast news programmes, MECCs per¬ 
formed outstandingly. A solution to this was proposed using 
Eactor Analysis (EA) to extract factors related to the genre, 
achieving 50% improvement over the use of MECC features 
on Internet videos ifTOl . 

Other approaches to this task ED El aim to identify spe¬ 
cific audiovisual events that can be used as semantic blocks to 
understand the narrative of the overall show or video. How¬ 
ever this is a more complex task, due to the need to identify 
very subtle events, and its performance still does not match 
the works previously mentioned in genre identihcation. 



The work in this paper aims to provide a novel set of 
long-term background-tracking features that can perform a 
more natural description of the type of acoustic background 
present, also tracking its temporal variations. In order to have 
robust genre classification abilities, these features should be 
able to represent different background conditions that can 
characterise shows, like studio recordings, outdoor noises, 
applause, laughter, different types of music, etc. On the other 
hand, to ensure generalisation in the genre classification task, 
the features should factor out the influence of the speaker and 
the foreground. The proposal explored in this work arises 
from the output of an asynchronous factorisation of back¬ 
ground and speaker with feature transformations, previously 
used in an ASR task lfT3]l . 

This paper is organised as follows: Section|^will present 
the audio processing system used to extract the background¬ 
tracking features from audio files. Section|^will describe the 
experimental setup designed to perform genre identification 
in a set of broadcast shows from the BBC. Finally, Sections 
and 1^ will present the results and conclusions of this work. 

2. BACKGROUND-TRACKING FEATURES 

In na, a novel method was presented to perform asyn¬ 
chronous factorisation of background and speaker in ASR 
tasks. This method relied in using a set of Constrained 
Maximum Likelihood Linear Regression (CMLLR) trans¬ 
formations HI characterising different possible background 
conditions that were switched asynchronously in the training 
and decoding process. As a byproduct, applying this set of 
background transformations asynchronously on a given au¬ 
dio segment will yield a sequence of states that will indicate 
which CMLLR transform was applied in each frame and, 
hence, which corresponding background was considered to 
be more likely. 

The first step in order to extract the proposed background¬ 
tracking features is to use a previously trained Hidden Markov 
Model (HMM) to align the input audio data to its tran¬ 
scription, or to the output of a previous decoding if the 
transcription is not available, using a set of asynchronous 
CMLLR transformations trained to represent different back¬ 
ground conditions. The sequence of transformations applied 
in the best path from the alignment can be written into a 
vector X = {x{0),x{l), ...,x{n), ...,x{N — 1)}, with N 
being the length of the input audio signal in frames and 
each value x{n) given by the index assigned to each back¬ 
ground CMLLR transformation from a fixed set of values 
{0, 1, ...,T — 1}, where T is the total number of back¬ 

ground CMLLR transforms. Indicator functions Ct(n), as 
defined in Equation [T] can be used to identify whether the 
value of x{n) is t or not. 
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Fig. 1. Background-tracking feature extraction. 

The new feature vector proposed in this work is de¬ 
noted as v{m) and can be calculated as the moving av¬ 
erage of the indicator functions ct{n) over a span of P 
of the original frames. This new vector has a length of 
M = N/P and a dimension of T, being formed by all 
the values vt{m), computed as in Equation generating 
v{m) = {vo{m),vi{m), ...,vt{rn), 

1 

vt{m) =— ^ct{m* P+ p) (2) 

p—0 

A graphical description of how this process is done can be 
seen in Eigure [T] In this example, there are T = 4 possible 
background transformations, and values are aggregated every 
P = 12 frames of the original input vector x{n) generated as 
output of the asynchronous alignment. 

3. EXPERIMENTAL SETUP 

The experiments for the evaluation of the proposed background¬ 
tracking features were done in a set of 332 shows, totalling 
231 hours, broadcast by the BBC during the first week of May 
in 2008. These programmes were divided into the following 
8 genres according to an internal BBC classification: 

• Advice: Consumer, DIY and property shows. 

• Children’s: Including cartoons and educational shows. 

• Comedy: Sit-coms and light entertainment shows. 

• Competition: Quiz shows and other contest shows. 

• Documentary: Including fly-on-the-wall shows. 

• Drama: Soap operas and other serialised dramas. 

• Events: Live events, sports and concerts. 

• News: Broadcast news and current affair shows. 


r 1 if x(n) = t These genres are very heterogeneous, as the BBC classi- 

I 0 otherwise fies a large number of subgenres. Eor instance, the “Events” 
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Fig. 2. l-minute samples of background-tracking features for different shows 


genre covers music shows as well as live sports; or the “Doc¬ 
umentary” genre covers nature documentaries as well as fly- 
on-the-wall shows. Since the dataset contains all the BBC 
broadcasts from a single week, covering all genres, it is a very 
complete scenario for the evaluation of background character¬ 
isation and genre identification techniques. 

For the experiments, 285 shows were used for training 
and 47 shows were used for testing. The number of shows 
and amount of time covered by each genre is presented in Ta¬ 
ble [T] The selection of the test set was done with the idea of 
providing equal coverage of genres and subgenres, with each 
genre represented by around 3 hours of broadcast time, ex¬ 
cept for documentaries that have a larger representation due 
to the multiple subgenres existing. Shows in the test set were 
also classified depending whether a previous instalment of 
the same show appeared in the training set, as this indicated 
whether speakers and environments appearing in the test set 
also appeared in the training set. 28 of the 47 shows had pre¬ 
vious instalments in the training set, with the remaining 19 
shows being unique instances of a show in the whole set. 

While full transcriptions were not available for these 
shows, the close captioning subtitles that were broadcast with 
the shows were available in order to train HMMs for ASR 
in an Hidden Markov Model Toolkit (HTK) setup ifTSl us¬ 
ing a lightly supervised training process 1161. 7 CMLLR 
asynchronous transformations were originally trained on a 
modified version of the WSJCAMO corpus ini, used in adap¬ 
tation experiments m, containing 7 types of acoustic back¬ 


Table 1. Distribution of shows by genre. 



Train 

Test 

Genre 

Shows 

Time 

Shows 

Time 

Advice 

34 

24.5h. 

4 

3.0h. 

Children’s 

45 

18.5h. 

8 

3.0h. 

Comedy 

20 

9.7h. 

6 

3.2h. 

Competition 

37 

25.9h. 

6 

3.3h. 

Documentary 

41 

29.8h. 

9 

6.8h. 

Drama 

19 

14.4h. 

4 

2.7h. 

Events 

23 

29.8h. 

5 

4.3h. 

News 

66 

50.3h. 

5 

2.0h. 

Total 

285 

203.Oh. 

47 

28.3h. 


grounds: clean speech; classical music; contemporary music; 
applause; cocktail party noise; traffic noise and wildlife noise. 
These transformations were asynchronously retrained in the 
BBC dataset to represent the different acoustic disturbances 
present in this data. The asynchronous alignment required 
to extract background-tracking features was also performed 
based on the existing subtitles with T = 7 and P = 100, so a 
7-dimensional feature vector was extracted from each second 
of the input audio. 

An illustration of the output of the background-tracking 
feature extraction can be seen in the images in Figure]^ These 
images visualise the 7-dimensional feature vectors extracted, 
as explained earlier in the Section. These samples represent 

















































4 periods of one minute (60 frames) from 4 different shows. 
The values of each of the 7 dimensions are represented by 
the size of the 7 coloured bars in each frame. Figure [2(a)| is 
one minute in a broadcast news programme, where the back¬ 
ground changes from music to street noise to clean studio and 
ends with street noise. Figure 2(b) is one minute in a music 


event show, where the music changes from rock music to solo 
singing and then to instrumental rock music. Figure [2(^ is 
one minute in a historical documentary show, that starts with 
bell sounds, followed by a period of music, another period of 
clean speech and finishes with sounds of seaside and birds. Fi¬ 
nally, Figure [2(d)] is one minute in a light entertainment show 
that mixes speech with long bursts of laughter. 



4. RESULTS 

The first set of experiments were designed to evaluate the per¬ 
formance of the proposed background-tracking features com¬ 
pared to short-term features in the genre identification task. 
Genre-based GMMs were trained with the feature vectors ex¬ 
tracted from all the shows in the training set belonging to each 
genre. A set of GMMs was trained with 13-dimenstional 
PLP features extracted every 10 ms. and another set with 
7-dimensional background-tracking features extracted every 
second. First and second derivatives were also computed and 
added to the feature vectors, for a total of 39 dimensions in the 
PLP features and 21 dimensions in the background-tracking 
features. The background-tracking features were tested on 
two conditions, the first one assuming that the subtitles of the 
shows in the test set were available for the alignment, and 
the second one using the transcription provided by the ASR 
system to do the alignment. The classification of the genre 
for each show in the test set was done by selecting the GMM 
that maximised the overall likelihood of all the input frames 
in the test show. The results in terms of accuracy (number of 
correctly classified shows divided by the total number of test 
shows) for different number of Gaussians in the GMMs for 
both types of features are presented in Figure]^ 

Background-tracking features outperformed PLPs in this 
task. While the proposed features achieved up to 72.4% accu¬ 
racy, PLPs only reached 61.7% accuracy. In further analysis, 
PLPs required a higher number of Gaussians (up to 1,024 and 
2,048) to achieve their best performance, while background¬ 
tracking features required less model complexity. This was 
due to the long-term nature of the background-tracking fea¬ 
tures, which were extracted every second, instead of every 10 
milliseconds. While a total of 73,528,233 frames were avail¬ 
able for training the PLP GMMs, only 730,621 were available 
with the background-tracking features. Figure]^ also shows 
that there was little difference between using the subtitles or 
the decoding transcripts to extract the background-tracking 
features in the test shows, indicating that the feature extrac¬ 
tion process was robust to the use of noisy transcriptions in 
the asynchronous alignment. Following this, all further ex¬ 


Fig. 3. Accuracy in genre identification PLP and 
background-tracking features with GMM classifiers (Thicker 
lines represent global accuracy, thinner lines represent accu¬ 
racy for repeated and non-repeated shows). 

periments were based on the alignment to the subtitles. 

The final element for analysis is presented in thinner lines 
around the main lines in Figure]^ These lines mark the accu¬ 
racies achieved in shows that have previous instalments in the 
training set and the accuracy achieved in the rest of the shows. 
PLP features presented a larger spread (represented by the 
shaded area in the Figure) between these two types of shows, 
15% to 20% difference in absolute accuracy across most of 
the range of GMM sizes, while background-tracking exhib¬ 
ited lower difference, 5% to 10% maximum. For shows with 
previous episodes in the training set, PLP features achieved 
67.8% accuracy, narrowing the gap to the 75.0% obtained 
with background-tracking features for the same shows. How¬ 
ever, for the rest of the shows, PLPs only reached 52.6% ac¬ 
curacy, while background-tracking features reached a more 
robust 68.4%. This pointed out how short-term features were 
more sensitive to the presence of known speakers and envi¬ 
ronments in the training set. 

Afterwards, more advanced classifiers were evaluated us¬ 
ing background-tracking features. Two experiments were set 
to study two aspects of classification: Modelling of temporal 
changes and discriminative methods. The first classifier used 
were HMMs, which are generative classifiers like GMMs, 
but, unlike GMMs, they also model temporal transitions 
among hidden states existing in the input data. For these 
experiments, HMMs with 8 states were found to provide the 
best performance and were, subsequently, used. The Gaus¬ 
sian components in each state and the transition probabilities 
among states were learnt using a Maximum Likelihood (ML) 
m approach from all the input feature vectors from the 
shows in the training data. The selection of the genre for each 
test show was also done maximising the likelihood. 

The second classifier used at this stage were Support Vec¬ 
tor Machines (SVM) 1^ . SVMs are widely used discrimi¬ 
native classifiers and had been previously used in the genre 












































Fig. 4. Accuracy in genre identification of GMM, HMM and 
SVM classifiers using background-tracking features. 

identification task 0. In these experiments, the inputs to the 
SVM classifier were supervectors obtained by concatenating 
the Gaussian means of show-based GMMs trained via Max¬ 
imum A Posteriori (MAP) adaptation ll2Tll . Gaussian-kernel 
SVMs were trained ll22]l for each genre to classify whether 
shows belonged or not to that genre. The final decision for 
each test show was made for the genre whose SVM gave the 
best score from all the genre-based SVMs. 

The results of the GMM, HMM and SVM classifiers are 
shown in Figure for different values of model complex¬ 
ity. They showed that both HMMs and SVMs outperformed 
GMMs. The best result for HMMs, 78.7% accuracy, was 
achieved with a total model complexity of 256 Gaussians 
(8 states with 32 Gaussians each); while the best result for 
SVMs, 80.9% accuracy, was achieved with a lower model 
complexity, only 16 Gaussians. 

To evaluate the identification abilities of the proposed sys¬ 
tems, the F-measure of the two best HMM and SVM systems 
are presented in Figure]^ for each genre. The F-measure, de¬ 
fined as the harmonic mean of precision and recall for each 
class, allows to evaluate the accuracy and specificity of a clas¬ 
sifier. Figure shows that SVMs performed better identify¬ 
ing the “Advice”, “Children’s”, “Events” and “News” genres, 
while HMMs outperformed SVMs in the “Comedy”, “Com¬ 
petition” and “Drama” genres. 

Finally, system combination based on the confidence 
scores given by the best HMM and SVM systems was per¬ 
formed 12^ . System combination has traditionally been 
proposed as a solid way of exploiting the outputs of different 
classifiers with different properties; in this task, the mod¬ 
elling of dynamics given by HMMs and the discriminative 
modelling provided by SVMs. The confidence of the HMM 
classifier was based on the likelihood score of the decided 
HMM; while the confidence score of the SVM classifier was 
based on the distance score provided by the decided SVM, 
both normalised to the range of [0,1]. When both systems 
provided the same hypothesis, this was accepted straight¬ 
away; but when they disagreed, the output of the system with 
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Fig. 5. F-measure for HMM and SVM classifiers. 

highest confidence was selected. The result of the combina¬ 
tion of both systems in terms of global accuracy was 83.0%. 

5. CONCLUSIONS 

The proposed background-tracking features have shown, 
through a range of different classifiers, that they can provide 
robust results in the task of genre identification of broadcast 
shows. While, in absolute terms, the use of acoustic and video 
features has been reported to provide better performance 
EiEia, the results are very promising when compared with 
previous results using only acoustic features. Furthermore, 
some types of broadcasts, such as radio or podcasts, do not 
have video and rely only on the acoustics for classification. 
Future work will have to see these novel acoustic features 
merged with state-of-the-art video features to compare with 
the best performing systems in this task. 

The experiments have also shown that the use of long¬ 
term features outperforms usual short-term features in tasks 
that require an acoustic characterisation of the background. 
Features like PLPs or MFCCs have great classification capa¬ 
bilities in speech but fail to generalise well, as shown by 0 
in their comparison of different datasets, because they mostly 
describe the phonemes or speakers in the audio. Long-term 
background-based features provide a more comprehensive 
description of the acoustic conditions of broadcasts, and are 
less sensitive to the recurring presence or not of the same 
speakers and environments. 

There are many other tasks where the background¬ 
tracking features could be exploited. In the future, these 
features can be used to automatically split complete shows or 
videos into homogeneous segments with a similar acoustic 
background. These segments could be clustered by simi¬ 
larity and then used to let users browse and link segments 
with a similar acoustic background. From the point of 
view of speech technologies, it is needed to explore how 
these features can be used in ASR tasks in noisy condi¬ 
tions. Background-tracking features could be used to adapt 







































































or compensate to background noises and disturbances, even 

in the case when the background changes asynchronously, 

enhancing ASR performance. 
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